A  WEIGHTED  DIFFERENCE  OF  ANISOTROPIC  AND  ISOTROPIC  TOTAL 
VARIATION  MODEL  FOR  IMAGE  PROCESSING 
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Abstract.  We  propose  a  weighted  difference  of  anisotropic  and  isotropic  total  variation  (TV)  as  a  regularization  for 
image  processing  tasks,  based  on  the  well-known  TV  model  and  natural  image  statistics.  Due  to  the  difference  form  of 
our  model,  it  is  natural  to  compute  via  a  difference  of  convex  algorithm  (DC A).  We  draw  its  connection  to  the  Bregman 
iteration  for  convex  problems,  and  prove  that  the  iteration  generated  from  our  algorithm  converges  to  a  stationary 
point  with  the  objective  function  values  decreasing  monotonically.  A  stopping  strategy  based  on  the  stable  oscillatory 
pattern  of  the  iteration  error  from  the  ground  truth  is  introduced.  In  numerical  experiments  on  image  denoising,  image 
deblurring,  and  magnetic  resonance  imaging  (MRI)  reconstruction,  our  method  improves  on  the  classical  TV  model 
consistently,  and  is  on  par  with  representative  start-of-the-art  methods. 
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1.  Introduction.  Many  image  processing  tasks  can  be  formulated  as  an  inverse  problem,  in 
which  the  data  /  is  assumed  to  be  obtained  approximately  by  applying  a  linear  operator  A  on  an 
image  u  with  additive  noise.  For  example,  A  is  the  identity  matrix  for  image  denoising,  a  convolution 
matrix  for  deblurring,  and  subsampling  of  Fourier  transform  for  a  magnetic  resonance  image  (MRI) 
reconstruction  problem.  In  most  scenarios,  solving  u  from  Au  —  f  is  ill-posed  in  the  sense  that  directly 
inverting  A  would  result  in  bad  and  possibly  multiple  solutions.  It  is  necessary  and  even  desirable  to 
constrain  the  solutions  through  regularization,  with  the  help  of  prior  knowledge  of  images  that  one 
wants  to  reconstruct.  A  general  model  for  such  inverse  problem  is 

w:=argminuJ(M)  +  |||Att-/||L  (1.1) 

where  J(u)  is  the  regularization  term,  /i  is  a  positive  parameter  to  balance  J(u)  and  the  data  fidelity 
term  || Au  —  f  |||,  and  u  is  an  optimal  solution  of  the  model  or  a  reconstructed  result.  A  classical 
regularization  is  the  total  variation  (TV)  proposed  by  Rudin-Osher-Fatemi  [33].  It  is  widely  used  in 
image  processing  applications,  such  as  deconvolution  [7,16,25],  inpainting  [6]  and  super-resolution  [26], 
just  to  name  a  few.  The  TV  model  originated  in  [33]  is  isotropic,  and  later  an  anisotropic  formulation 
has  been  addressed  in  the  literature  [10,30]  among  others.  We  give  mathematical  definition  for  both 
the  isotropic  and  anisotropic  TV  in  the  discrete  setting.  Denoting  u  as  the  column  vector  by  a 
lexicographical  ordering  of  a  2D  image,  we  have 

Jiso(u)  m  Vu\\2  \\yJ\Dxu\*  +  |D„u|2||i,  (1.2) 

Jani(u)  :=  \\Dxu\\i  +  ||A,u||l  ,  (1.3) 

where  Dx,Dy  denote  the  horizontal  and  vertical  partial  derivative  operators.  Throughout  this  paper, 
we  shall  use  notations  || Vtx|| 2  and  || yj \Dxu\ 2  +  |D2/i/|2||i  interchangeably. 

Another  interpretation  of  TV  can  be  given  from  the  perspective  of  compressive  sensing  (CS) 
[3, 12],  which  is  to  reconstruct  a  signal  from  an  under-determined  system  provided  that  the  signal  is 
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sufficiently  sparse  or  sparse  in  a  transform  domain.  For  example,  a  natural  image  is  mostly  sparse 
after  taking  gradient.  Mathematically,  it  amounts  to  minimizing  the  Lq  norm  of  the  image  gradient, 
z.e.,  J{u)  =  ||Vix||o.  To  bypass  the  NP-hard  L0  norm,  the  convex  relaxation  approach  in  CS  is  to 
replace  L0  by  Li,  and  L\  on  the  gradient  is  the  total  variation.  The  restricted  isometry  property 
(RIP)  condition  [3]  theoretically  guarantees  the  exact  recovery  of  sparse  solutions  by  L\.  The  RIP 
regime  is  where  the  sensing  matrix  is  incoherent ,  such  as  a  random  Gaussian  matrix.  Several  non- 
convex  penalties  have  been  proposed  and  studied  as  alternatives  to  Li,  [19].  A  few  notable  examples 
are  Lp  for  p  G  (0, 1)  [8,21,39],  L1/L2  (scale  invariant  L\)  and  L\  —  L2  [13,22,23,40,41].  In  particular, 
L\  —  L2  penalty  is  found  to  be  the  best  among  existing  methods  for  recovering  sparse  solutions  when 
the  sensing  matrix  is  highly  coherent  or  significantly  violating  the  RIP  condition  [23,41]. 

The  TV-regularization  has  been  a  very  active  research  topic  in  the  past  two  decades.  Though  a 
gradient  descent  approach  in  the  original  paper  can  be  slow  to  converge,  a  projection  algorithm  is  later 
proposed  by  Chambolle  [5]  to  speed  up  convergence  based  on  duality.  More  recently,  the  Bregman 
and  split  Bregman  methodology  [9,15,29]  offers  another  line  of  fast  algorithms  equivalent  to  the  role 
of  alternating  direction  method  of  multipliers  (ADMM)  and  Douglas-Rachford  splitting  algorithm  in 
the  optimization  literature  dating  back  to  the  1970’s.  There  are  also  a  few  approaches  to  solve  the 
Lq  minimization  directly.  In  [38],  a  special  alternating  minimization  strategy  with  half-quadratic 
splitting  is  adopted  for  image  smoothing.  Image  restoration  via  L0  is  considered  in  [31],  which  uses 
hard  shrinkage  for  Lq  as  opposed  to  soft  shrinkage  for  L\.  In  addition,  the  Lq  on  the  gradient  can  be 
interpreted  as  the  length  of  the  partition  boundaries,  which  leads  to  the  classical  Potts  model  [32]  or 
piece- wise  constant  Mumford-Shah  model  [28]  for  image  segmentation  or  partition.  Recently,  Storath 
et.  al.  [34]  propose  a  hybrid  ADMM  and  dynamic  programming  method  to  solve  the  Potts  model. 

Motivated  from  L\  —  L2  minimization  of  coherent  CS  [23,41],  we  propose  the  following  weighted 
difference  of  convex  regularization, 

J(u)  :=  Jani  -  aJiso  =  \\Dxu\\x  +  \\Dyu\h  -  a\\yJ\Dxu\2  +  |Uyu|2||i  ,  (1.4) 

where  a  G  [0, 1]  is  a  parameter  for  a  more  general  model.  When  a  =  1,  J{u)  is  to  apply  L\  —  L2 
on  the  gradient  vector.  Two  advantages  of  L\  —  L2  over  other  nonconvex  measures  are  its  Lipschitz 
regularity,  and  guaranteed  convergence  via  the  difference  of  convex  algorithm  (DCA)  [35,36].  We  find 
that  the  DCA  requires  solving  the  L\  type  of  minimization  as  a  subproblem,  which  can  be  handled 
efficiently  by  utilizing  the  split  Bregman  technique.  We  prove  that  the  DCA  approach  converges  to 
stationary  points,  a  typical  situation  for  nonconvex  problems.  In  practice,  the  DCA  iterations,  when 
properly  stopped,  are  often  close  to  global  minima  and  produce  excellent  results.  The  stopping  issue 
is  discussed  later  based  on  the  oscillatory  pattern  of  the  iteration  errors. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  describes  our  model  in  detail  including 
numerical  algorithms  and  convergence  analysis.  Section  3  is  devoted  to  numerical  experiments,  where 
three  image  processing  applications  (denoising,  deblurring  and  MRI  reconstruction)  are  examined. 
Finally,  discussions  and  conclusions  are  given  in  Section  4  and  Section  5  respectively. 

2.  Our  model.  Let  (ujX,  Ujy )  be  gradient  vector  at  pixel  j.  Then  equation  (1.4)  can  be  rewritten 
as 

J(u)  =  Yi  (K*l  +  \uJy\ -  +  u%)  •  (2-1) 

3 

This  point-wise  formulation  suggests  that  sparsity  is  enforced  on  every  gradient  vector.  More  specifi¬ 
cally,  we  encourage  the  gradient  to  be  1-sparse  at  every  pixel,  which  implies  that  horizontal  or  vertical 
edges  are  more  preferable  in  this  model.  In  order  to  understand  the  image  gradient  and  1-sparsity,  we 
plot  the  histogram  of  gradient  angles  over  the  range  of  [0,90]  degree  in  Figure  2.1  for  a  large  number 
of  natural  images.  The  angle  distribution  in  other  quadrants  is  similar.  As  shown  in  Figure  2.1,  the 
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Fig.  2.1.  The  histogram  of  gradient  angles  over  300  images  from  Berkeley  segmentation  dataset  [27]. 
peaks  are  at  0  and  90  degrees,  indicating  that  gradient  vectors  are  mostly  1- sparse. 


Two  largest 


two  largest  peaks  are  at  0  and  90  degrees,  which  implies  that  gradient  vectors  are  1-sparse  at  a  fairly 
good  chance,  with  non-sparse  occurrences  also  at  positive  probability.  Hence  we  insert  a  constant  a 
in  (1.4)  to  reflect  such  behavior  in  the  histogram.  In  Figure  2.2,  we  plot  the  level  lines  of  norm  on 
the  gradient,  whose  value  is  0  at  origin,  1  at  axes,  and  2  elsewhere.  The  level  lines  corresponding  to 
a  <  1  in  (1.4)  is  closer  to  Lq  than  that  of  a  =  1  in  the  sense  that  the  latter  yields  0  at  both  axes. 

Let  us  derive  the  value  of  a  based  on  the  gradient  distribution.  Suppose  that  the  gradient  value 

Dxu  follows  the  distribution  [21],  Le~|a:1P  where  T(t)  =  /0+°°  xt~1e~x.  It  is  Gaussian  distribution 

1  v  p  ) 

for  p  =  2,  Laplacian  distribution  for  p  =  1,  and  hyper-Laplacian  for  0  <  p  <  1.  We  have 


Ei 


e2 


E\Dxu\  = 

E\Dxu\2 


P  f +°° 

2r(V” 

P 

2r  (J) 


-!J'l''|.r|d.r 

e~^P\x\2dx 


e  H* 


-1 


d  t  = 


r(|) 


fp_1df  = 


r  (f) 

r(?) 


(2.2) 

(2.3) 


The  value  of  a  corresponds  to  the  ratio  of  L\  and  L2,  ie., 


Ei  r(2/P) 

Ve~2  v/r(3/p)r(i/p)  ' 


Table  2.1  lists  the  values  of  a  based  on  gradient  distributions  for  p  =  0.5, 1,  2.  We  analyze  the  gradient 
distribution  in  Figure  2.3  which  shows  that  the  distribution  of  image  gradient  data  matches  the  p  =  1/2 
distribution  better  than  classical  Gaussian  (p  =  2)  or  Laplacian  (p  =  1)  distribution.  This  observation 
is  consistent  with  the  choice  of  hyper-Laplacian  [4,21]  for  image  processing  (p  G  [0.5,  0.8]).  In  the 
rest  of  the  paper,  we  shall  fix  the  weighting  coefficient  a  =  1/2  to  approximate  the  desired  value  in 
Table  2.1. 
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Fig.  2.2.  Level  curves  of  different  metrics.  The  level  lines  corresponding  to  a  <  1  in  (1-4)  is  closer  to  Lq  than 
that  of  a  =  1  in  the  sense  that  the  latter  yields  0  at  both  axes. 

Table  2.1 

The  value  of  a  based  on  the  gradient  distribution. 


p 

a 

0.5 

0.5477 

1 

0.7071 

2 

0.7979 

2.1.  Numerical  algorithms.  To  solve  (1.1)  with  J(u)  defined  in  (1.4),  we  apply  the  technique 
of  difference  of  convex  algorithm  (DCA)  by  linearizing  the  isotropic  term 

Un+1  =  argmin  \\Dxu\\ i  +  \\Dyu\\i  -  a(Vu,qn)  +  fr\\Au  -  /|||  ,  (2.5) 

u  Z 


for  qn  =  (q^Qy)  =  (Dxun,  Dyun)  /  y/\Dxun\2  +  \Dyun\2  at  step  un .  Note  that  qn  is  a  point-wise 
calculation;  and  if  the  denominator  is  zero  at  some  point,  the  corresponding  qn  value  is  set  to  be  zero. 
Each  DCA  subproblem,  eq.  (2.5),  amounts  to  solving  a  TV  type  of  minimization.  We  employ  the  split 
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Fig.  2.3.  The  plot  of  log  probability  v.s.  gradient  in  comparison  with  different  distributions,  indicating  that  the 
gradient  distribution  of  a  large  natural  image  dataset  matches  L1/2  or  the  p  =  1/2  hyper- Laplacian  distribution  better 
than  classical  Gaussian  or  Laplacian  distribution. 


Bregman  technique  [15]  to  do  the  job.  Specifically,  we  introduce  two  auxiliary  variables  and  split  the 
anisotropic  term  in  the  following  way, 

uk+1  =  arg  min  \\dx\\t  +  \\dy\\i  -  a(d^  ■  q™  +  (f  ■  Qy) 

u,dx,dy 

+  f  \\Au  - /111  +  ±\\dx  -  Dxu\\l  +  ±\\dy  -  Dyu III  ,  (2.6) 

in  which  dXl  dy  can  be  updated  via  soft  shrinkage,  defined  as 

shrink(,s,7)  =  sgn(s)  max{|s|  —7,0}  .  (2.7) 

The  pseudo-code  is  summarized  in  Algorithm  1.  The  algorithm  is  efficient  for  many  applications  where 
the  matrix  to  be  inverted  is  diagonal  or  can  be  diagonalized  by  Fourier  transform,  which  is  true  for 
image  denoising,  deconvolution,  and  MRI  reconstruction. 


Algorithm  1  for  solving  unconstrained  problem  (2.5) 

Define  u  =  qx  =  qv  =  0,  z  =  f  and  Max  DC  A ,  MaxBreqman 

for  1  to  MAX  DC  A  do 

bx  by  0 

for  1  to  M  AX  Bregman  do 

u  =  ( ftATA  -  AA )~1(ftAz  +  XD^(dx  -  bx)  +  XD^(dy  -  by)) 
dx  =  shimk(Dxu  +  bx  +  aqx/X,  1/A) 
dy  =  shrink  (Dyu  +  by  +  aqy/X ,  1/A) 
bx  —  bx  T  Dxu  T  dx 

by  -  by  “b  DyU  dy 

end  for 

(QxiQy)  =  Dylt) / yJ\Dxli\^  T 

end  for 


For  the  corresponding  constrained  problem, 


min  \\Dxu\h  +  \\Dyu\\ i  -  a||Vu||2  s.t. 

II 

(2.8) 

the  DC  A  is  expressed  as 

Un+l  =  argminlllD^ulli  +  ||£>yu||i  -  a(Vu,qn) 

U 

s.t.  Au  =  /}  . 

(2.9) 
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Each  DCA  subproblem  could  be  reduced  to  a  sequence  of  unconstrained  problems  of  the  form 

Uk+ 1  =  argmin  ||£>xu|J |  +  j  Dvu  ji  -  a(Vu,qn)  +  ^ \\Au  -  zk\\\  ,  (2.10) 

u  Z 

zk+1  =  zk  +  f  -  Auk+1  .  (2.11) 

Again  the  first  equation  can  be  solved  by  the  split  Bregman  method.  Algorithm  2  for  solving  the 
constrained  problem  (2.9)  is  almost  the  same  as  Algorithm  1,  except  for  an  additional  update  on  z. 


Algorithm  2  for  solving  constrained  problem  (2.9) 

Define  u  =  qx  =  qy  =  0,  2  =  /  and  MaxDCA ,  M axBregmanlnner ,  M axBregmanOuter 

for  1  to  MAXDCA  do 

bx  -  by  —  0 

for  1  to  M axBregmanOuter  do 
for  1  to  M AX Bregmanlnner  do 

u  =  (/ 'iAT A  -  AA )~1(/aAz  +  XD^(dx  -  bx)  +  XDy  (dy  -  by)) 
dx  =  shrink  (Dxu  +  bx  +  aqx/ A,  1/A) 
dy  =  shnnk(Dyu  +  by  +  aqy/  A,  1/A) 
bx  —  bx  H-  Dxu  H-  dx 

by  -  by  H-  DyU  dy 

end  for 

z  =  z  +  f  —  Au 

end  for 

{Qx^qy)  =  (D^iq  DyU)/ yj\ Dxu\^  -1-  ~\Dyu\^ 

end  for 


2.2.  Convergence  analysis.  We  want  to  show  that  the  sequence  of  {un}  obtained  from  the 
DCA  iterations,  z.e.,  eq.  (2.5),  converges  to  a  stationary  point.  The  standard  DCA  requires  strong 
convexity  to  prove  convergence  [35],  here  we  can  get  rid  of  this  requirement  using  the  fact  that  L\  is 
convex  (not  strictly  though),  and  its  subgradient1  is  a  close  set.  We  first  prove  two  lemmas  saying 
that  the  objective  function  is  coercive  and  monotonically  non-increasing  for  the  minimizing  sequence; 
and  then  complete  the  convergence  proof. 

Lemma  2.1.  Suppose  (i  >  0,0  <  a  <  l,  and  ker(A)  f)ker(D)  =  {0},  where  D  =  [Dx\Dy\.  Then 
the  objective  function 

F(u)  :=  \\Dxu\U  +  \\Dyu\U  -  a\\yJ\Dxu\2 +  \Dyu\2\\i  +  f  II Au  -  /|||  , 


is  coercive. 

Proof  It  suffices  to  show  that  for  any  fixed  u  G  RN  \  {0},  F(^u)  -A  oo  as  7  — )>  00.  We  discuss  two 
cases  separately. 

•  If  u  £  ker(D ),  then  we  have 

F^u)  >  (1  -  a^dlDxulh  +  HA^Hi)  ^00  as  7  00  , 

since  \\y/\Dxu\2  +  |Uj/u|2||1  <  \\Dxu\\i  +  ||£>yu||i. 

•  If  u  G  ker(D ),  then  Au  ^  0,  since  ker(A)  fj  ker(D)  =  {0}  and  u  7^  0.  Therefore,  we  have 

F{lu)  >  7,hAu~  f\\l  >  f(7ll^lli  -  II/II2)  ->•  00  as  7  — >  00  . 


1We  say  a  vector  g  is  a  subgradient  of  /  at  x  G  dom(/)  if  f(z)  ^  f(x)  +  gT  (z  —  x)  for  all  z  G  dom(/). 


Anisotropic-isotropic  TV 


7 


□ 

Lemma  2.2.  If  the  sequence  {un}  is  generated  by  the  DCA  algorithm  (2.5),  i.e.; 

?/n  it 

un+l  =  argmin  ||£>x«||i  +  \\Dyu\h  -  a(  n  ,  Vu)  +  | \\Au  -  f  \\j  , 

then 

F(un)  -F{un+1)  >  0.  (2.12) 

Proof.  It  follows  from  the  first-order  optimality  condition  at  un+1  that  there  exist  p"+ 1  G 
d\\Dun+1  ||i  such  that 

pn+1  -  aqn  +  pAT(Aun+1  -  /)  =  0  .  (2.13) 

A  simple  calculation  shows  that 
F(un)  -  F(un+1) 

=  ||| A(un  -  un+l)\\l  +  p{A(un  -  un+1),Aun+1  -f)  +  ||Dun||1  -  ||Du"+1||i  -  a(||Vu"||2  -  ||Vu”+1||2) 
=  |||  A(un  -  un+l)\\l  -  (pn+1  -  uqn ,  un  -  un+1)  +  puli  -  ||Dun+1||i  -  a(||Vu"||2  -  ||Vu”+1||2) 

=  |P(w"  -  un+l)\\22  +  (||£>«n||1  -  (pn+1,un))+a(\\Vun+1\\2  -  (qn,Vun+1))  . 

The  second  equality  above  is  obtained  from  left  multiplying  (2.13)  by  ( un  —  un+1)T ,  and  third  one 
uses  the  fact  that  (pn+1  ,un+1)  =  ||i>un+1||i. 

The  chain  rule  of  subgradient  [18]  suggests  that  <9||I>u||i  =  DTd\\Du\\i,  where 

d\r\i  =  {  t-1,  ^  r  =  0  , 

1  1  |  sign(r)  otherwise  . 

It  implies  that  ||i>un||i  —  (pn+1,un)  =  ||i>un||i  —  (pn+1,Dun)  >  0,  since  pn+1  <  1  for  each  component. 
Using  the  definition  of  subgradient  and  the  fact  that  qn  E  5||Vr||2,  we  have  ||V^n+1||2  —  ( qn ,  V^n+1)  ^ 

0,  which  concludes  the  proof.  □ 

Theorem  2.3.  Under  the  assumptions  in  Lemma  2.1,  any  non-zero  limit  point  u*  of{un}  satisfies 
the  first-order  optimality  condition,  which  means  u *  is  a  stationary  point. 

Proof.  It  follows  from  Lemma  2.1  and  Lemma  2.2  that  the  objective  function  F  is  coercive, 
monotonically  decreasing,  and  convergent.  As  a  result,  the  sequence  {un}  is  bounded,  and 

||A(w"  —  un+1)||2  — >•  0  ,  (2.14) 

||Vwn+1||2  —  (qn ,X7un+1)  — >  0  .  (2.15) 

We  start  our  algorithm  from  u°  =  0,  and  u1  is  obtained  from 

u1  =  argmin  HDuIIj  +  ^\\Au-  /|||  . 

If  u 1  is  a  constant  vector,  we  stop  our  algorithm;  otherwise  F(u[ )  st  F(v)  for  any  constant  vector  v. 
Since  F  monotonically  decreases,  any  subsequent  solution  un  for  n  ^  1  is  not  constant.  Then  we  can 
properly  define 


(Vun,  Vwn+1) 
||Vixn||2||Vun+1||2  ' 
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Eq.  (2.15)  suggests  that  (1  —  cn)||Vun+1||  0.  Therefore,  cn  1.  It  follows  from  the  minimiz¬ 
ing  sequence  that  ||Vun||i  —  ||Vun+1||i  0,  and  so  V(un  —  un+1)  0.  Combining  (2.14)  and 

ker(A)  p|  ker(D)  =  {0},  we  get  un  —  un+1  0. 

This  implies  that  there  exists  a  subsequence  of  {un}  converging  to  u*,  denoted  as  {uUk}.  The 
optimality  condition  at  the  Uk~  th  step  of  DC  A  reads 

\/ q  j'ft'k  1 

0  e  d\\Dunk ||i  +  aV  •  ||ywnfc_1||2  +  ^AT (Aunk  -  /)  ,  (2-16) 


or 


\7unk-l 

■aV  • - 


-  iiAT(Aunk  -f)€  d\\Dunk  || !  . 


We  can  show  DuUk  converges  to  Du*,  as 


(2.17) 


\\Dunk  -Du*||  <  ||D||  •  \\unk  -u*||  ->0  as  nk  -»•  oo  . 


When  rik  is  sufficiently  large,  supp(Du*)  C  supp (Dunk)  and  sign(Dunfc)  =  sign(Du*).  Using  the  chain 
rule  of  subgradient,  we  have  d\Dunk\i  C  d\Du*\i,  and  DTd\Dunk\i  C  DTd\Du*\i.  Consequently,  it 
follows  from  (2.17)  that 

Ol'kl'k  1 

~v '  ||v»~-'||  “  f‘AT{Au’"  -  })  e  D|D“‘|1 ' 

We  assume  =  0  if  ||Vu||  =  0  at  some  points.  Letting  ^  oo,  we  obtain 

V?y* 

-V  •  -  MT(Au*  -f)e  D\Du*\i  , 

which  means  that  u*  satisfies  the  first-order  optimality  condition.  □ 

3.  Experiments.  We  apply  the  proposed  method  to  three  applications:  image  denoising,  decon¬ 
volution,  and  the  MRI  construction.  The  matrix  A  in  these  examples  can  be  diagonalized  by  Fourier 
transform,  and  hence  Algorithm  1  or  Algorthm  2  can  be  efficiently  implemented.  We  compare  L\  and 
L\  —  aL2  for  a  =  0.5  or  1  with  some  existing  methods,  such  as  Lq  for  image  smoothing  in  [38],  Lq 
in  [31],  Lp  for  p  =  2/3  in  [21],  and  L\  +  L\  in  [2]  for  image  deblurring.  We  use  structural  similarity 
(SSIM)  index  [37]  as  a  quantitative  measure  for  image  quality.  Let  us  first  define  local  similarity  index 
computed  on  windows  x  and  y , 


ssim(x,  y) 


(2 PxPy  T  ci)(2<jCCy  T  C2) 

(/'7  I  /'j  •  Cl )(<j2  +  0-2  +  c2)  ’ 


(3.1) 


where  ijlx,Hv  are  the  average  of  x,y,  cr2;.,  are  the  variance,  axy  is  covariance  of  x,  y,  and  ci,C2  are 
two  variables  to  stabilize  the  division  with  weak  denominator.  The  overall  SSIM  is  the  mean  of  local 
similarity  indexes,  z.e., 


SSIM(X,  Y)  :=  L  ssim (xh  Vi)  ,  (3.2) 

V  i= 1 

where  X  is  a  reference  image,  Y  is  a  distorted  one,  yi  are  corresponding  windows  indexed  by  i,  and 
N  is  the  number  of  windows.  Here  we  consider  windows  of  size  8x8. 

Image  denoising.  We  examine  the  problem  of  image  denoising  using  an  artificial  piece-wise 
constant  image  in  Figure  3.1  and  a  Lena  image  in  Figure  3.2.  We  assume  zero- mean  additive  Gaussian 
noise  with  standard  deviations  being  0.2  and  0.05  respectively.  Not  only  does  our  method  work 
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noisy,  SSIM  =  0.140 


I/O,  SSIM  =  0.148 


Li  + 1/2,  SSIM  =  0.943 


Fig.  3.1.  Denoising  results  with  comparison  to  Lq  in  [38]  and  L\  +  in  [2]. 


particularly  well  on  horizonal  or  vertical  edges  by  design,  it  can  deal  with  natural  images  as  well.  To 
verify  convergence  analysis,  the  difference  of  un  and  t/n_1  versus  iterations  is  plotted  in  logarithm 
scale  for  both  denoising  examples  shown  in  Figure  3.3,  which  shows  that  Li  —  0.51/2  converges  faster 
than  L\  —  L2.  As  the  ground-truth  is  available,  we  plot  the  relative  errors  versus  cpu  runtime  for 
Li,Li  —  L2,L\  —  O.5L2  in  Figure  3.4.  This  figure  implies  that  our  solutions  oscillate  around  the 
ground  truth  due  to  the  nonconvex  nature  of  our  model.  Additionally  we  observe  that  the  larger  a  is 
(say  approaching  1),  the  less  well-behaved  DC  A  becomes  due  to  more  weight  on  the  nonconvex  term. 
On  the  other  hand,  L\  —  L2  yields  better  results  than  L\  —  0.51/2  for  the  first  few  DC  A  iterates.  The 
denoising  results  presented  in  Figure  3.1  and  Figure  3.2  are  from  stopping  DC  A  after  2  iterations. 

Image  deblurring.  In  Figure  3.5,  a  binary  image  is  vertically  blurred  by  motion  blur  of  15  pixels 
plus  Gaussian  additive  noise  with  zero  mean  and  standard  deviation  0.1.  Our  method  outperforms  L0 
in  [31],  Lp  for  p  =  2/3  in  [21],  L\  +  L2  in  [2],  and  the  state-of-the-art  deblurring  method  BM3D  [11]. 
In  Figure  3.6,  we  present  deblurring  results  for  a  natural  image:  Cameraman.  The  original  image  is 
blurred  by  15  x  15  Gaussian  blur  whose  standard  deviation  is  1.5  plus  Gaussian  additive  noise  with 
zero  mean  and  standard  deviation  0.05.  Although  Lo,T2/3  and  BM3D  are  better  than  ours  in  terms 
of  SSIM,  their  results  have  some  ringing  artifacts.  In  both  deblurring  examples,  our  method  is  better 
than  the  classical  L\  approach.  The  relative  errors  versus  computational  time  is  plotted  in  Figure  3.7 
for  both  examples.  It  shows  similar  behavior  as  in  the  denoising  problem  that  L\  —  L2  tends  to 
worsen  beyond  certain  iterations  while  L\—  O.5L2  is  more  stable.  The  deblurring  results  presented  in 
Figure  3.5  and  Figure  3.6  are  from  stopping  DCA  after  2  and  10  iterations  for  L\  —  O.5L2  and  L\  —  L2 
respectively.  A  discussion  on  stopping  criterion  is  given  later. 

MRI  reconstruction.  In  Figure  3.8,  we  investigate  the  MRI  reconstruction  problem  using  a 
Shepp-Logan  phantom  from  7  and  8  radial  projections.  There  is  no  noise  when  we  synthesize  the  data. 
Consequently  we  adopt  the  constrained  formulation,  z.e.,  Algorithm  2  for  solving  eq.(2.9).  Due  to  the 
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noisy,  SSIM  =  0.7317 


I/0,  SSIM  =  0.875 


Li  +L%,  SSIM  =  0.913 


L i  -  0.5 L2,  SSIM  =  0.939 


Fig.  3.2.  Denoising  results  with  comparison  to  Lq  in  [38]  and  L\  +  T32  in  [2]. 


iterations 


iterations 


Fig.  3.3.  The  difference  of  un  and  un  1  versus  iterations  is  plotted  in  logarithm  scale  for  denoising  examples  in 
Figure  3.1  (left)  and  Figure  3.2  (right).  L\  —  0.5L2  converges  faster  than  L\  —  L^. 


presence  of  complex  values  in  MRI  reconstruction  problem,  SSIM  is  no  longer  applicable;  instead  we 
use  root-mean-square  (RMS)  error  to  measure  the  performance  quantitatively.  RMS  between  reference 
and  distorted  images  X,  Y  is  defined  as  RMS(X,  Y)  =  ~^=  ||X  —  Y\\2  where  M  is  the  number  of  pixels 
in  images  X,Y.  Figure  3.8  shows  that  our  method  can  get  a  perfect  reconstruction  using  only  8 
projections,  while  a  similar  work  [8]  reports  that  10  projections  are  required.  When  the  number  of 
projections  is  down  to  7,  L\  —  0.51/2  is  much  better  than  L\  and  L\  —  L 2  visually  as  well  as  in  terms 
of  RMS.  The  relative  errors  versus  cpu  time  is  plotted  in  Figure  3.9.  The  relative  errors  of  L\  —  L2 
iterations  in  the  constrained  formulation  appear  as  stable  oscillations  in  contrast  to  the  unstable 
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Fig.  3.4.  The  relative  errors  versus  runtime  for  methods  L\ ,  L\  —  L2,  L\  — O.5L2  for  denoising  examples  in  Figure  3.1 
(left)  and  Figure  3.2  (right).  Our  model  solutions  are  seen  to  oscillate  around  the  ground  truth  due  to  nonconvexity. 


oscillations  in  the  unconstrained  problems. 

4.  Discussions.  Let  us  draw  some  connections  of  this  work  to  two  existing  methods,  Lysaker- 
Osher-Tai  (LOT)  model  [24]  and  Bregman  iterations  [29].  Additionally,  we  comment  on  the  stopping 
criterion. 


4.1.  Relation  to  existing  methods.  At  first,  the  iterative  scheme  (2.6)  for  a  =  1  resembles 
the  work  of  denoising  the  normals,  proposed  by  Lysaker-Osher-Tai  [24], 

Un+1  =  argmin„  ||Vw||2  -  qn  •  Vw  +  |||Au  -  /|||  ,  (4.1) 

Vun 

where  qn  =  —  —  is  the  surface  normal.  Notice  that  the  TV  norm  in  (4.1)  is  isotropic,  while  the  first 

term  in  our  model  is  the  anisotropic  TV;  and  hence  L\  —  L2  applied  to  the  gradient  with  linearized 
1/2  term  is  different  from  the  LOT. 

On  the  other  hand,  the  LOT  model  leads  to  the  discovery  of  Bregman  iterations  [29],  which  relates 
to  the  DC  A  as  well.  Specifically,  the  Bregman  distance  [1]  based  on  a  convex  functional  J(-)  between 
two  points  u  and  v  is  defined  as 


Dj(u , v)  :=  J(u)  —  J(v)  —  (p, u  —  v)  , 


(4.2) 


where  p  G  dJ(v)  is  the  subgradient  of  J  at  the  point  v.  Osher  et.  al.  [29]  suggest  an  iterative  refinement 
procedure  to  update  u  as  follows, 


n+ 1  _ 


=  argmin  DPj  ( u,un )  +  ^\\Au-  f\\\  , 

=  argmin  J(u)  -  ( pn,u )  +  |||  Au  -  f\\\  , 


(4.3) 

(4.4) 


which  is  referred  to  as  the  Bregman  iterations.  Let  J(u)  =  ||  || 2  be  the  isotropic  TV  as  in  the  LOT 
model,  and  its  subgradient  has  the  form  —V  •  — .  Consequently,  we  rewrite  the  second  term  in  Eq. 

|Vu| 

(4.4)  as 


(-V- 


Vun  , 
|Vixn| ,U 


Vun 

V«ni 


v«)  , 


(4.5) 


which  coincides  with  the  second  term  in  the  LOT  model  (4.1). 
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original 


L0,  SSIM  =  0.879 


noisy,  SSIM  =  0.2596 


L2/ 3,  SSIM  =  0.8871 

n 


BM3D,  SSIM  =  0.917 

n 

Lx+L\,  SSIM  =  0.934 


SSIM  =  0.945  h  -  L2,  SSIM  =  0.974  Lx  -  0.5 L2,  SSIM  =  0.967 

O'!  O'!  O'! 

n  n  n 


Fig.  3.5.  Deblurring  results  with  comparison  to  Lq  in  [31],  Lp  for  p  =  2/3  in  [21],  L\  +  in  [2]  and  the 
state-of-the-art  deblurring  method  BM3D  [11]. 


Bregman  iterations  can  be  viewed  as  an  optimization  technique.  Computing  the  optimality  con¬ 
dition  for  each  subproblem  (4.4),  we  obtain 

pn+1  -  pn  +  /iAT(Aun+1  -  /)  =  0  .  (4.6) 

Summing  up  to  n  +  1,  we  have  pn+1  —  fiAT (un+1  —  zn )  for  p°  =  0  and  zn+1  =  zn  +  (/  —  Aun).  It  is 
the  optimality  condition  for  solving  un+1  from  argmin  J(u)  +  ^ \\Au  —  zn |||.  In  short,  the  Bregman 
iterations  can  be  rewritten  as 

un+1  =  argmin  J{u)  +  | \\Au  -  zn \\j  , 
zn+ 1  =  zn  +  (/  -  Aun)  . 


(4.7) 

(4.8) 
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noisy,  SSIM  =  0.5459 


BM3D,  SSIM  =  0.849 


SSIM  =  0.8317 


SSIM  =  0.81 


Li  —  0.51/2,  SSIM  =  0.828 


b  IG.  o.b.  UebLurmng  results  with  compo 
state-of-the-art  deblurring  method  BM3D  [11]. 


The  DCA  for  solving  L\  —  L 2  minimization  can  be  derived  from  a  similar  way  of  the  Bregman 
iterations.  Let  p  and  q  be  the  subgradient  of  anisotropic  Jani  and  isotropic  Jiso  respectively.  Lagging 
the  isotropic  term  gives  us 


We  apply  the  same  summation  technique  as  in  (4.6)  and  obtain 
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cpu  time  cpu  time 


Fig.  3.7.  The  relative  errors  versus  runtime  for  methods  L\,L\  —  £2,-^1  —  O.5L2  for  deblurring  examples  in 
Figure  3.5  (left)  and  Figure  3.6  (right). 


Li,  RMS  =  0.481 


L%  -L2,  RMS  =  0.384 


Lx  -0.5 L2,  RMS=0.355 


Fig.  3.8.  MRI  reconstruction  using  7  (top)  and  8  projections  (bottom).  The  root-means- error  (RMS)  is  provided 
for  comparison. 


for  p°  =  q°  =  z°  =  0.  The  subproblem  (4.10)  is  equivalent  to 

Un+1  =  argmin  Joni(u)  -a(qn,u)  +  ^\\Au  -  zn\\l  ,  (4.12) 

which  looks  very  similar  to  applying  the  DCA  for  a  constrained  problem,  eq.  (2.5).  The  algorithm 
derived  from  the  Bregman  iterations  is  summarized  in  Algorithm  3.  Its  difference  to  Algorithm  2  lies 
in  the  update  of  z  and  q.  For  Algorithm  2,  z  is  updated  M axBregmanOuter  iterations  and  then 
q  is  updated,  while  Algorithm  3  is  to  update  2)  and  q  simultaneously.  The  comparison  between  the 
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Fig.  3.9.  The  logarithm  of  relative  errors  versus  runtime  for  methods  L i,  L\  —  L2,  T\  — O.5L2  in  MRI  reconstruction 
problem  using  7  (left)  and  8  (right)  projections.  All  are  solved  under  constrained  formulation. 


Bregman  and  DCA  iterations  for  solving  such  constrained  nonconvex  problems  is  a  subject  of  further 
study. 


Algorithm  3  for  solving  constrained  problem  (2.9)  using  Bregman  method 
Define  u  =  qx  =  qy  =  0,  2  =  /  and  MaxDCA ,  MaxBregman 
for  1  to  MAXDCA  do 
bX  -  by  —  0 

for  1  to  MaxBregman  do 

u  =  ( fiAT A  —  AA )~1(gAz  +  XD^(dx  —  bx)  +  XDy  (dy  —  by)) 
dx  =  shrink(Z>xiz  +  bx  +  aqx/ A,  1/A) 
dy  =  shrink(Dy?i  +  by  +  aqy/X ,  1/A) 
bx  —  bx  +  Dxu  H-  dx 
by  =  by  +  Dyu  +  dy 
end  for 
z  =  z  +  f  —  Au 

{Qx:  qy )  =  {C)XU:  DyU) / yJ^D^U^~~\~~ [Dyt/p 

end  for 


4.2.  Stopping  criterion.  We  discuss  the  stopping  conditions  of  Algorithm  1  and  Algorithm  2 
for  unconstrained  and  constrained  problems  respectively.  Both  algorithms  have  an  outer  DCA  loop, 
which  iteratively  updates  q ,  and  inner  iterations  for  updating  u.  We  use  and  Uk  to  specify  the  outer 
and  inner  outputs  of  u. 

The  inner  loop  is  easier  to  impose  a  proper  stopping  criterion  for,  because  the  inner  loop  solves 
a  convex  subproblem.  Some  standard  stopping  criteria  are  either  the  relative  error  being  small  or 
objective  function  being  stagnant  or  both  z.e., 

\\uk+i-Uk\\  ,,  \F(uk+i)-F(uk)\  ,  . 

- n — 77 <  eu  and/or  - — — tt -  <  eF  (4.13) 

IKII  I^K)I 

with  pre-defined  tolerance  values  eu,eF-  In  this  paper,  we  choose  to  stop  the  inner  iteration  when  the 
relative  error  is  smaller  than  le-6. 

As  for  the  outer  iterations,  Figures  3.4,  3.7,  and  3.9  show  that  the  relative  error  develops  an 
oscillatory  pattern.  One  can  estimate  the  onset  time  4  of  the  oscillation  stage  of  the  error  based  on 
training  images.  In  the  denoising  (deblurring)  example,  tb  =  2  (=  10).  Hence,  a  good  stopping  time 
for  the  outer  iteration  is  at  the  end  of  an  inner  loop  when  the  cpu  time  exceeds  tb. 
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More  generally,  if  the  error  does  not  follow  a  clear  oscillatory  pattern,  one  could  inject  random 
perturbations  with  slowly  reduced  magnitudes  to  steer  away  from  unstable  stationary  points  or  direc¬ 
tions  to  help  convergence  towards  the  ground  truth  [17].  This  approach  is  closely  related  to  simulated 
annealing  [14,20]. 

5.  Conclusion.  We  proposed  a  weighted  difference  of  anisotropic  and  isotropic  total  variation  as 
a  regularization  term  for  image  processing  applications.  We  presented  a  difference  of  convex  algorithm 
(DC A)  for  both  the  constrained  and  unconstrained  formulations.  We  proved  the  convergence  of  the 
algorithm  to  ensure  that  each  limiting  point  is  a  stationary  point  and  the  values  of  the  objective  function 
monotonically  decrease.  The  behavior  of  the  iterations  was  observed  numerically  to  be  oscillatory 
around  the  ground  truth.  The  deviation  occurs  at  the  beginning  of  outer  loops  of  DCA.  A  stopping 
criterion  was  introduced  based  on  such  oscillatory  pattern  of  the  errors. 

In  the  numerical  experiments,  we  examined  three  particular  applications:  image  denoising,  deblur¬ 
ring  and  MRI  reconstruction.  By  design,  our  method  works  particularly  well  for  piecewise  constant 
images.  For  natural  images,  it  improved  the  classical  TV  model,  and  is  comparable  to  the  state-of- 
the-art  methods.  In  future  work,  we  plan  to  carry  out  a  detailed  comparison  between  the  DCA  and 
Bregman  methods,  and  further  study  the  error  pattern  and  the  resulting  stopping  criterion  for  other 
imaging  science  problems. 
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